feat(prometheus): request/limit overlay, rightsizing, HPA + PVC + restart charts#753
feat(prometheus): request/limit overlay, rightsizing, HPA + PVC + restart charts#753nadaverell wants to merge 2 commits into
Conversation
…tart charts Expands inline Prometheus surface on resource detail pages without trying to become a Grafana replacement. Five additions, each hidden silently when its data source isn't present (no nag, no "no data" messages). * Request/limit dashed reference lines on existing CPU + memory area charts, summed across runtime containers (excluding pure init, including native sidecars) × readyReplicas for replicated workloads. * Rightsizing strip on Deployment/StatefulSet/DaemonSet full-screen Metrics tab. P95 over 24h with KRR-style headroom (15% CPU, 10% memory). Tone policy is mild: 2-3x headroom reads as "well-sized", only >5x mem or >8x CPU surfaces as info "could reduce". Red reserved for actual OOM risk (memory P95 >= 95% of limit); orange only for confirmed CPU throttling. * HPA detail page gets a replicas chart (current/desired lines + min/max reference lines) via KSM. * PVC renderer gets a single-line usage gauge via kubelet_volume_stats_*. Hidden silently when the CSI driver doesn't report or Prom isn't scraping kubelet (notably GMP default). * Restart event lane below the Metrics chart — vertical markers on a dedicated row rather than overlaying the waveform, so clusters of restarts stay readable. Brittleness mitigations: every new feature gates on a query that returns no series when its dependency (KSM, kubelet scrape, CSI NodeGetVolumeStats) is missing, and hides rather than rendering an error or "not configured" panel. PrometheusCharts' existing empty state still surfaces the "Discover Prometheus" CTA when nothing is connected. Adds `HPARenderer` and `PVCRenderer` to RendererOverrides so the host can wrap them with platform data hooks without modifying the base renderers' core layout.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 802c80d. Configure here.
| } | ||
| } | ||
|
|
||
| ratio := reqVal / p95 |
There was a problem hiding this comment.
Division by zero when P95 usage is zero
Medium Severity
When p95 is exactly zero (e.g., an idle container with no CPU usage over 24h), ratio := reqVal / p95 produces +Inf. This flows into fmt.Sprintf("Over-provisioned by %.1fx — could reduce", ratio) which renders as "Over-provisioned by +Infx — could reduce" in the user-facing rightsizing strip. A guard for p95 <= 0 (or near-zero) before computing the ratio would prevent this broken display.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 802c80d. Configure here.
* Reference line scale: PromQL is per-pod, drop the readyReplicas multiplier
so the line lands on the same axis as the chart's per-pod series.
* RightsizingResponse.Rows: initialize as []RightsizingRow{} in the no-
containers branch so the wire format matches the TS contract (non-null).
* loadWorkloadContainers: return sentinel errors so the handler can map
cache-not-ready to 503 and RBAC-denied (nil lister) to 403 instead of
reporting both as 404.
* PVC usage handler: log Prom query failures via errorlog so operators
can distinguish "Prom unhealthy" from "CSI doesn't report".
* RBAC gate: new SetAuthGate hook in the prometheus package; server wires
it to canRead. The new rightsizing + PVC endpoints now require the
caller to be able to "get" the underlying resource before reading from
the SA-populated informer cache.
* Restart event lane: emit a marker only when the rolling-window count
increases (or first-sample-nonzero). The previous "every positive
sample" rule turned one restart into ~60 markers + a ~60× total.
* readQuantity: NaN guards on the suffix paths so malformed YAML can't
poison the request/limit sum.
* HPACharts + RestartEventLane: consume the React Query error field so
Prom-side failures get a console.warn breadcrumb instead of silently
looking identical to no-data.
* PVCUsageBar: paired light/dark text tones (text-red-700 / text-red-400)
so the percentage stays legible in both themes.
* Comment drift: "Pods are too granular" reworded to match the gate;
PVC label-fallback comment trimmed to match what the code actually does.
* Add rightsizing_test.go: tone-classifier boundaries (3x/5x/8x ratios,
0.95 mem-OOM threshold), recommendRequest 10m/16Mi rounding, native-
sidecar inclusion, formatRightsizingValue edge cases.


Summary
Expands inline Prometheus surface on resource detail pages without trying to become a Grafana replacement. Five additions, all gated on graceful degradation — each is hidden silently when its data source isn't present (no nag, no "no data" panels).
1. Request/limit overlay on CPU + memory charts. Dashed reference lines on the existing area chart, summed across runtime containers (pure init excluded, native sidecars with
restartPolicy: Alwaysincluded) ×readyReplicasfor replicated workloads. The chart's Y axis auto-extends to fit the limit line so it doesn't clip when usage runs hot.2. Rightsizing strip on Deployment / StatefulSet / DaemonSet full-screen Metrics tab. Backed by a new
/prometheus/rightsizing/{kind}/{ns}/{name}endpoint that issues P95-over-24h subqueries per container. Tone policy is deliberately mild — most workloads are 2–3× over-provisioned and that's fine:Well-sized, no badgeNx headroom, neutralinfo"could reduce" (muted blue, not red)warning"consider setting"alert"throttling likely" (orange)critical"OOM risk" (red — the only red case)3. HPA replicas chart. Bottom of HPA detail. Current/desired as two-line chart, min/max as reference lines. Sourced from
kube_horizontalpodautoscaler_status_{current,desired}_replicas. Observed-CPU-vs-target chart deferred — KSM doesn't expose the observed value reliably across versions.4. PVC usage gauge. Single-line capacity bar on the PVC renderer via
kubelet_volume_stats_{used,capacity}_bytes. Traffic-light tone at 75%/90%. Hides silently when CSI doesn'tNodeGetVolumeStatsor Prom isn't scraping kubelet (notably GMP default).5. Restart event lane. Below the Metrics chart on workload + Pod detail. Vertical markers on a dedicated row rather than overlaying the chart waveform — clusters of restarts stay readable. Uses
changes(kube_pod_container_status_restarts_total[1h]); hidden when KSM isn't reporting.Brittleness
Every new feature gracefully degrades:
NodeGetVolumeStats(or k8s 1.34 volume-stats regression) → PVC gauge hidesArchitecture notes
HPARendererandPVCRenderertoRendererOverridesso the host can wrap them with Prom-backed sections viaextraSectionswithout touching the base renderers' core layout.MetricsTabContentcomposes RightsizingStrip + PrometheusCharts + RestartEventLane on the workload Metrics tab. RightsizingStrip is gated onexpandedso it doesn't appear in drawer mode (wrong granularity for "what is this" view).PrometheusCharts.Test plan
go build ./...,go vet ./...,npm run tsc,make buildall cleango test ./internal/prometheus/...passesNodeGetVolumeStatsto confirm it hidesNote
Medium Risk
Adds new Prometheus-backed API endpoints and UI surfaces (rightsizing recommendations, PVC usage, restart/HPA charts) plus request/limit overlays, which could impact performance and correctness of metrics and involves RBAC gating on cached K8s specs.
Overview
Expands the Prometheus feature set with two new backend endpoints:
/prometheus/pvc/{namespace}/{name}(PVC usage fromkubelet_volume_stats_*) and/prometheus/rightsizing/{kind}/{namespace}/{name}(per-container CPU/memory P95-based request recommendations), both protected by a new request-scopedAuthGatewired fromserver.canRead.Adds a new Prometheus metric category
restarts(PromQLchanges(kube_pod_container_status_restarts_total[1h])) and updates the web UI to surface restart event markers, an HPA replicas-over-time chart, and a PVC usage gauge, all designed to hide when Prometheus/KSM/kubelet series are unavailable.Enhances existing Prometheus CPU/memory charts with request/limit reference-line overlays computed from the resource’s pod spec (including native sidecars, excluding pure init containers), and adds a workload metrics header strip showing rightsizing recommendations for supported workload kinds.
Reviewed by Cursor Bugbot for commit c80200f. Bugbot is set up for automated code reviews on this repo. Configure here.